be confident in cleaning datasets, preparing them for analysis
understand how to manipulate variables, and create new variables, in preparation for analysis
7.2 Introduction
A lot of the data you’ll deal with in sport is ‘messy’. That is, it may not land on your screen in a format that is immediately ready for analysis.
Therefore, some of the key steps we need to take include:
making sure we can read and write the data files;
making sure the variables are named consistently and accurately;
dealing with missing data;
dealing with outliers;
making sure each variable type is defined correctly;
we have all the variables that we need to conduct our analysis.
The following sections describe how a number of these steps can be achieved in R. From my experience, this stage is likely to be the most time-consuming (and frustrating) part of any data analysis, so it’s worth making sure you fully understand the commands and processes outlined below.
7.3 The ‘tidyverse’ package
tidyverse is a collection of R packages designed for data manipulation, exploration, and visualisation. Make sure you have it installed.
As noted previously, there are lots of different ways to do the same thing in R. We’re going to stick to using the tidyverse library.
Two of the core packages within tidyverse, readr and dplyr, provide really useful functions for reading and writing data in various formats.
You encountered one part of the tidyverse package, tibbles, in the previous reading (‘Data Structures in R’). However, there are a lot of other functions within this package that make it well worth learning.
7.4 Reading and Writing Data using tidyverse
Reading Data - readr
readr is a package within Tidyverse that provides functions to read data from common file formats, including CSV, TSV, and fixed-width files. It’s designed for fast and efficient data reading with user-friendly parsing and type conversion.
To read data from a CSV file, we can use the read_csv() function:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <-'https://www.dropbox.com/scl/fi/6pctzyujt0mgvxvxbtmvw/ah_data_02.csv?rlkey=fqbc76dlqzp22ie6qm3urd3jv&dl=1'# Import a CSV filecsv_data <-read_csv(url)
Rows: 26 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Team
dbl (9): Pos, Pl, W, D, L, F, A, GD, Pts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(csv_data)
# A tibble: 6 × 10
Pos Team Pl W D L F A GD Pts
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Arsenal 30 23 4 3 72 29 43 73
2 2 Manchester City 29 21 4 4 75 27 48 67
3 3 Newcastle United 29 15 11 3 48 21 27 56
4 4 Manchester United 29 17 5 42 44 37 7 56
5 5 Tottenham Hotspur 30 16 5 9 55 42 13 53
6 6 Aston Villa 60 14 5 11 41 40 1 47
For other file formats, we can use the corresponding tidyverse read functions, such as read_tsv() for TSV files or read_fwf() for fixed-width files.
Writing Data - readr
readr also provides functions for writing data to common file formats, like CSV and TSV. For example, the write_csv() function can be used to write a data frame to a CSV file:
# Write data to a CSV filewrite_csv(csv_data, "output.csv")
Similarly, we can use the write_tsv() function to write data to a TSV file.
7.5 Data cleaning and transformation
dplyr is another core package within tidyverse that provides a set of tools for data manipulation, such as filtering, selecting, and summarising data.
The following procedures are some of the most common ones you’ll want to use when preparing your data prior to analysis.
Remember to keep using your environment window to check what’s being produced by these code snippets.
Filtering rows with a specific value, and selecting specific columns
# Load the required packageslibrary(tidyverse)# create a sample tibbledata <-tibble(id =1:5,category =c("A", "B", "A", "B", "A"),value =c(23, 45, 12, 78, 37))# Filter rows with category 'A' and select columns 'id' and 'value'filtered_data <- data %>%filter(category =="A") %>%select(id, value)print(filtered_data)
# A tibble: 3 × 2
id value
<int> <dbl>
1 1 23
2 3 12
3 5 37
Selecting specific columns
# Select columns by nameselected_data <- data %>%select(id, category, value)
Sorting your data by column values
# Sort data in ascending ordersorted_data <- data %>%arrange(value)# Sort data in descending ordersorted_data <- data %>%arrange(desc(value))
Creating new columns
# Add a new column with calculated valuesnew_data <- data %>%mutate(new_column = id * value)
Grouping data, and performing aggregations
# Group data by a column and calculate the mean of another columngrouped_data <- data %>%group_by(category) %>%summarize(mean_value =mean(value))
Combining reading, writing, and manipulating data
You can chain together reading, manipulating, and writing data using the %\>% pipe operator from tidyverse.
This allows for a more readable and efficient workflow.
In the following example, we read data from a CSV file, filter it, and then write it back to a new CSV file.
# Read data from a CSV file, filter, and write to a new CSV fileread_csv(url) %>%filter(Pl =="30") %>%select(Team, W) %>%write_csv("filtered_data.csv")
Rows: 26 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Team
dbl (9): Pos, Pl, W, D, L, F, A, GD, Pts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When you run this code, can you interpret what is happening? Check your files and you should see a new CSV file called ‘filtered_data.csv’.
7.6 Data reshaping with tidyr
tidyr is another core tidyverse package that provides functions for cleaning and reshaping data. It helps create “tidy” data, where each variable is a column, and each observation is a row.
7.7 Gather multiple columns into key-value pairs (wide to long format)
# Gather columns 'column1', 'column2', and 'column3' into key-value pairslong_data <- data %>%gather(key ="variable", value ="value", id, category, value)
7.8 Spread key-value pairs into separate columns (long to wide format)
# Spread key-value pairs in 'id' and 'value' columns into separate columnswide_data <- data %>%spread(key ="id", value ="value")
7.9 Separate a single column into multiple columns
# Separate 'column_name' into two new columns 'column1' and 'column2', splitting by a delimiter (e.g., '-')separated_data <- data %>%separate(value, into =c("column1", "column2"), sep ="-")
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 5 rows [1, 2, 3,
4, 5].
Unite multiple columns into a single column
# Unite columns into a new column 'column_name', with delimiter (e.g., '-')united_data <- data %>%unite(column_name, id, value, sep ="-")
7.10 Alternative Approaches
In the preceding sections, we used the tidyverse package to conduct various operations on a newly-imported dataset.
Before tidyverse was introduced, it was possible to conduct the same operations, using the following code.
Many coders still use these commands, and it is important to be familiar with them even if you don’t use them yourself.
Reading a .csv file
rm(list=ls()) # this code cleans my environment# load librarylibrary(dplyr)# import dataurl <-'https://www.dropbox.com/scl/fi/6pctzyujt0mgvxvxbtmvw/ah_data_02.csv?rlkey=fqbc76dlqzp22ie6qm3urd3jv&dl=1'data <-read.csv(url)head(data) # display the first six rows
Pos Team Pl W D L F A GD Pts
1 1 Arsenal 30 23 4 3 72 29 43 73
2 2 Manchester City 29 21 4 4 75 27 48 67
3 3 Newcastle United 29 15 11 3 48 21 27 56
4 4 Manchester United 29 17 5 42 44 37 7 56
5 5 Tottenham Hotspur 30 16 5 9 55 42 13 53
6 6 Aston Villa 60 14 5 11 41 40 1 47
str(data) # inspect variable types
'data.frame': 26 obs. of 10 variables:
$ Pos : int 1 2 3 4 5 6 7 8 9 10 ...
$ Team: chr "Arsenal" "Manchester City" "Newcastle United" "Manchester United" ...
$ Pl : int 30 29 29 29 30 60 28 29 30 29 ...
$ W : int 23 21 15 17 16 14 13 12 10 11 ...
$ D : int 4 4 11 5 5 5 7 8 999 6 ...
$ L : int 3 4 3 42 9 11 8 9 7 12 ...
$ F : int 72 75 48 44 55 41 52 50 47 39 ...
$ A : int 29 27 21 37 42 40 NA 35 40 40 ...
$ GD : int 43 48 27 7 13 1 16 15 7 -1 ...
$ Pts : int 73 67 56 56 53 47 46 44 43 39 ...
Remove outliers (‘999’)
# Check each row if 999 is present and, if so, replace with 'NA'data[data ==999, ]
Pos Team Pl W D L F A GD Pts
NA NA <NA> NA NA NA NA NA NA NA NA
NA.1 NA <NA> NA NA NA NA NA NA NA NA
NA.2 NA <NA> NA NA NA NA NA NA NA NA
NA.3 NA <NA> NA NA NA NA NA NA NA NA
NA.4 NA <NA> NA NA NA NA NA NA NA NA
NA.5 NA <NA> NA NA NA NA NA NA NA NA
NA.6 NA <NA> NA NA NA NA NA NA NA NA
NA.7 NA <NA> NA NA NA NA NA NA NA NA
NA.8 NA <NA> NA NA NA NA NA NA NA NA
NA.9 NA <NA> NA NA NA NA NA NA NA NA
NA.10 NA <NA> NA NA NA NA NA NA NA NA
NA.11 NA <NA> NA NA NA NA NA NA NA NA
NA.12 NA <NA> NA NA NA NA NA NA NA NA
NA.13 NA <NA> NA NA NA NA NA NA NA NA
NA.14 NA <NA> NA NA NA NA NA NA NA NA
NA.15 NA <NA> NA NA NA NA NA NA NA NA
NA.16 NA <NA> NA NA NA NA NA NA NA NA
NA.17 NA <NA> NA NA NA NA NA NA NA NA
NA.18 NA <NA> NA NA NA NA NA NA NA NA
NA.19 NA <NA> NA NA NA NA NA NA NA NA
NA.20 NA <NA> NA NA NA NA NA NA NA NA
NA.21 NA <NA> NA NA NA NA NA NA NA NA
NA.22 NA <NA> NA NA NA NA NA NA NA NA
NA.23 NA <NA> NA NA NA NA NA NA NA NA
NA.24 NA <NA> NA NA NA NA NA NA NA NA
NA.25 NA <NA> NA NA NA NA NA NA NA NA
NA.26 NA <NA> NA NA NA NA NA NA NA NA
NA.27 NA <NA> NA NA NA NA NA NA NA NA
NA.28 NA <NA> NA NA NA NA NA NA NA NA
NA.29 NA <NA> NA NA NA NA NA NA NA NA
NA.30 NA <NA> NA NA NA NA NA NA NA NA
NA.31 NA <NA> NA NA NA NA NA NA NA NA
NA.32 NA <NA> NA NA NA NA NA NA NA NA
NA.33 NA <NA> NA NA NA NA NA NA NA NA
NA.34 NA <NA> NA NA NA NA NA NA NA NA
NA.35 NA <NA> NA NA NA NA NA NA NA NA
NA.36 NA <NA> NA NA NA NA NA NA NA NA
NA.37 NA <NA> NA NA NA NA NA NA NA NA
NA.38 NA <NA> NA NA NA NA NA NA NA NA
NA.39 NA <NA> NA NA NA NA NA NA NA NA
NA.40 NA <NA> NA NA NA NA NA NA NA NA
NA.41 NA <NA> NA NA NA NA NA NA NA NA
NA.42 NA <NA> NA NA NA NA NA NA NA NA
NA.43 NA <NA> NA NA NA NA NA NA NA NA
NA.44 NA <NA> NA NA NA NA NA NA NA NA
NA.45 NA <NA> NA NA NA NA NA NA NA NA
NA.46 NA <NA> NA NA NA NA NA NA NA NA
NA.47 NA <NA> NA NA NA NA NA NA NA NA
NA.48 NA <NA> NA NA NA NA NA NA NA NA
NA.49 NA <NA> NA NA NA NA NA NA NA NA
NA.50 NA <NA> NA NA NA NA NA NA NA NA
NA.51 NA <NA> NA NA NA NA NA NA NA NA
NA.52 NA <NA> NA NA NA NA NA NA NA NA
NA.53 NA <NA> NA NA NA NA NA NA NA NA
NA.54 NA <NA> NA NA NA NA NA NA NA NA
NA.55 NA <NA> NA NA NA NA NA NA NA NA
NA.56 NA <NA> NA NA NA NA NA NA NA NA
NA.57 NA <NA> NA NA NA NA NA NA NA NA
NA.58 NA <NA> NA NA NA NA NA NA NA NA
Removing rows with missing data
data01 <-na.omit(data) # removes any row (observation) with missing data
Remove a variable ‘A’ from dataset ‘data’
data02<-subset(data01, select =-c(A))
Rename variables in dataset ‘data’ and create a new dataset ‘data02’
data03 <-rename(data02, league_position = Pos, games_played = Pl) # create the new datasetrm(data, data01, data02) # remove original dataset from memory
Change variable types
data03$Team =as.factor(data03$Team) # make team a factor
Create a new logical true/false variable based on existing variables
We can create a TRUE/FALSE variable (logical) based on how many points a team has accrued.
# first, we replace the existing values with TRUE or FALSEdata03$over_50[data03$Pts >=50] <-TRUEdata03$over_50[data03$Pts <50] <-FALSE# then, we change the variable type to logicaldata03$over_50 =as.logical(data03$over_50)# finally, we check out dataset and make sure the variable type has changedhead(data03) # show first six rows
league_position Team games_played W D L F GD Pts over_50
1 1 Arsenal 30 23 4 3 72 43 73 TRUE
2 2 Manchester City 29 21 4 4 75 48 67 TRUE
3 3 Newcastle United 29 15 11 3 48 27 56 TRUE
4 4 Manchester United 29 17 5 42 44 7 56 TRUE
5 5 Tottenham Hotspur 30 16 5 9 55 13 53 TRUE
6 6 Aston Villa 60 14 5 11 41 1 47 FALSE